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DETAILED ACTION 
Claim Rejections - 35 USC § 102 

1 . The following is a quotation of the appropriate paragraphs of 35 U.S.C. 1 02 that 
form the basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 
(e) the invention was described in- 

(1) an application for patent, published under section 122(b), by another filed in the United States before the invention 
by the applicant for patent; or 

(2) a patent granted on an application for patent by another filed in the United States before the invention by the 
applicant for patent, except that an international application filed under the treaty defined in section 351 (a) shall have 
the effects for the purposes of this subsection of an application filed in the United States only if the international 
application designated the United States and was published under Article 21 (2) of such treaty in the English. 

2. Claim 1, 3, 5-7, 9-11 are rejected under 35 U.S.C. 102(e) as being anticipated by 
van der Schaar et al, US 2002/0006161. 

Re claim 1 , van der Schaar et al discloses a method for encoding frames of input 
video (fig. 3a), comprising the steps of: 

processing said input video ("original video", 106) to produce a compressed base 
layer bit stream (110); 

processing said input video to produce a compressed enhancement layer bit 
stream (150 - ); 

identifying a region of interest in a video frame (section 0023); 

and enhancing the quality of the region of interest by providing additional bits for 
coding said region (section 0023, in this segment, van der Schaar et al discusses 
transmitting "designated areas" within an image at a higher priority than other areas of 
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the image. In other words, designated areas are regions of interest being coded with at 
a higher priority i.e. with more bit allocation than other areas of the image). 

Re claim 3, the method as defined by claim 1 , wherein said step of providing 
additional bits for coding said region comprises providing additional bits for said region 
in the compressed enhancement layer bit stream (w/r to discussion in claim 1 , also see 
sections 0023-0024). 

Re claim 5, the method as defined by claim 3, wherein said processing to 
produce a compressed enhancement layer bit stream includes a bit plane shifting step, 
and wherein said step of providing additional bits for said region includes increasing the 
bit shifting values in said region (sections 0025-0026). 

Re claim 6, the method as defined by claim 1 , wherein said step of processing 
said input video to produce a compressed base layer bit stream includes forming motion 
vectors, and wherein said step of identifying a region of interest in a video frame 
includes basing said identifying on said motion vectors (fig. 3a: "motion 
estimation/compensation", also sections 0037-0038, in these segments, the position of 
areas of interest are inherently provided by the motion estimation, which is used to 
detect motion vectors. It is noted that '161 is for MPEG-4, which is the coding protocol 
for image object segmentation or region-of-interest). 

Re claim 7, the method as defined by claim 3, wherein said step of processing 
said input video to produce a compressed base layer bit stream includes forming motion 
vectors, and wherein said step of identifying a region of interest in a video frame 
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includes basing said identifying on said motion vectors (the limitations have analyzed 
and rejected w/r to claim 6 above). 

Re claim 9, the method as defined by claim 6, wherein said step of identifying a 
region of interest in a video frame based on said motion vectors includes basing said 
identification on the magnitude of motion vectors (w/r to claim 6, also it is inherent that 
motion estimation disclosed in '161 not only detects motion vectors, but also provide 
identification on the magnitude of motion vectors because motion estimation needs to 
know the magnitude of motion vectors to identify the optimum motion vector for 
compensation). 

Re claim 10, the method as defined by claim 6, wherein said step of identifying a 
region of interest in a video frame based on said motion vectors includes basing said 
identification on the intensity change of neighboring regions based on motion vectors 
(w/r to claims 6 and 9, also it inherent that motion estimation disclosed in '161 evaluate 
motion vectors based on identification on the intensity change of neighboring regions 
because this step has to take place in order to determine the minimum motion vector). 

Re claim 1 1 , the method as defined by claim 3, wherein said step of processing 
said input video to produce a compressed base layer bit stream includes forming motion 
vectors and determining motion compensation values, and wherein said step of 
identifying a region of interest in a video frame includes basing said identifying on said 
motion vectors and said motion compensation values (the limitations have been 
analyzed w/r to claims 6, 9 and 10, furthermore fig. 3a shows motion estimation and 
compensation are involved). 
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Claim Rejections - 35 USC § 103 

3. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

4. Claims 2, 4, 8 and 12 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over van der Schaar et al, US 2002/0006161 (hereinafter '161) as applied to claim 1 
above and further in view of van der Schaar et al, US 6,501 ,797 (hereinafter 797). 

Re claim 2, van der Schaar et al '161 discloses providing additional bits for 
coding designated areas or region of interest for enhancement, but fails to disclose 
whether said region comprises providing additional bits for said region in the 
compressed base layer bit stream as claimed. 

van der Schaar et al 797 discloses providing additional bits for said region in the 
compressed base layer bit stream (fig. 3: 322, col. 7, line 39 to col. 8, line 19, in this 
segment, van der Schaar et al discusses the output signals of base layer adjust or 
monitor the operation of the enhancement rate allocator circuit "358"). 

Taking the combined teaching of van der Schaar et al'161 and 797 as a whole, 
it would have been obvious to implement providing additional bits for coding designated 
areas or region of interest for enhancement layer coding by providing additional bits for 
said region in the compressed base layer bit stream as claimed for the benefit of 
improving coding efficiency of the enhancement layer coding and improving image 
quality (797 col. 2, line 60 to col. 3, line 11). 
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Re claim 4, the method as defined by claim 2, wherein said processing to 
produce a compressed base layer bit stream includes a quantization step, and wherein 
said step of providing additional bits for said region includes decreasing the quantization 
step in said region (w/r to discussion in claim 2, furthermore, base layer allocator "322" 
regulates the bit amount by controlling the quantization "316". Decreasing quantization 
step translates to increase bit amount and vice versa). 

Re claim 8, the method as defined by claim 4, wherein said step of processing 
said input video to produce a compressed base layer bit stream includes forming motion 
vectors, and wherein said step of identifying a region of interest in a video frame 
includes basing said identifying on said motion vectors (the limitations have analyzed 
and rejected w/r to claim 6 above. Claim 2 provides reasons/motivation for combined 
teaching). 

Re claim 12, the method as defined by claim 4, wherein said step of processing 
said input video to produce a compressed base layer bit stream includes forming motion 
vectors and determining motion compensation values, and wherein said step of 
identifying a region of interest in a video frame includes basing said identifying on said 
motion vectors and said motion compensation values (the limitations have been 
analyzed and rejected w/r to claim 11. Claim 2 provides reasons/motivation for 
combined teaching). 
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Contact 

5. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Vu Le whose telephone number is 703-308-6613. The 
examiner can normally be reached on M-F 8:30-6:00. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Chris Kelley can be reached on 703-305-4856. The fax phone number for 
the organization where this application or proceeding is assigned is 703-872-9306. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). 
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Spatially Scalable Video Compression 
Employing Resolution Pyramids 

Klaus Illgner and Frank Miiller 



Abstract — In this paper, a spatially scalable video coding 
scheme for low bit rates is proposed. The codec is especially well 
suited for communications applications because it is based on 
motion-compensated predictive coding which provides the neces- 
sary low-delay property. The frames to be coded are decomposed 
into a Gaussian pyramid. Motion estimation and compensation 
are performed between corresponding pyramid levels of succes- 
sive frames. We show that, to fulfill specific needs of spatial 
scalability, the motion compensation on each level must result in 
compatible prediction errors (displaced frame differences, DFD). 
Compatibility of the prediction errors means that the pyramid 
formed by independently obtained DFD's (the DFD pyramid) 
is close to a Gaussian pyramid decomposition of the DFD of 
the highest resolution level. From the DFD pyramid, a least 
squares Laplacian pyramid is derived, which is quantized and 
coded. The DFD encoder outputs an embedded bit stream. Thus, 
the coder control may truncate the bit stream at any point, 
and can keep a fixed rate. The motion vector fields obtained at 
the different resolution levels are also encoded by employing a 
pyramid approach. Simulation results show that the proposed 
coder achieves a coding gain compared to simulcast coding. 

Index Terms — Laplacian pyramid, multiresolution pyramid, 
scalable motion compensation, spline fitter, video compression. 



I. Introduction 

THE rapid development of networks leads to increasing in- 
terest in video communications. All current standards for 
communications-related video compression (H.261, H.263) are 
based on a single resolution hybrid (i.e., motion-compensated 
predictive) coding scheme. However, bandwidth limitations, 
multipoint operation with receivers of different capabilities, 
and bandwidth-dependent charging are good reasons to use 
scalable coding schemes. Also, MPEG-4 announced strong 
demands for scalable video coding algorithms. 

The term scalability is ambiguous because it is used in a 
spatial, temporal, or SNR context. In this paper, the focus is 
on spatial multiresolution schemes since this is an essential 
prerequisite to provide the property of spatial scalability. This 
means that the receiver can reconstruct frames of smaller 
spatial resolution using only a subset of the complete bit 
stream, thus fulfilling the above-mentioned demands. 

In search of an appropriate coding principle, spatiotemporal 
subband or wavelet approaches are problematic because the 
coding delay must be kept as small as possible for communica- 
tions applications. A promising approach is therefore to extend 
the classical hybrid coding principle, which is the basis of 

Manuscript received September 9, 1996; revised July 2, 1997. 
The authors are with the Institut fur Elektrische Nachrichtentechnik, RWTH 
Aachen, 52056 Aachen, Germany. 

Publisher Item Identifier S 0733-8716(97)07705-6. 



most coding schemes for communications, to fulfill the needs 
of spatial scalability. 

Known multiresolution approaches based on the hybrid 
coding principle utilize several DPCM loops on different 
resolution levels. The design objective of a spatially scalable 
video codec is to find optimal predictions for each resolution 
level. However, this task cannot be solved straightforwardly 
because the prediction errors have to be coded efficiently. 
This can be accomplished only by exploiting the dependencies 
between the different resolution levels. The degree of coupling 
between the DPCM loops makes for the main difference 
between various schemes. 

A straightforward approach is to decompose the frames into 
multiple resolution frames, and to code each resolution level 
independently. However, this approach results in significant 
coding overhead. The scalable codec described in [1] is 
based on MPEG-II, and employs several DPCM loops on 
different resolution levels. There exists a slight interconnection 
between the layers since motion estimation is performed in a 
hierarchical fashion. 

A tighter interconnection is realized in [2], where a two- 
layer scalable pyramid codec based on two DPCM loops 
is described. Hierarchical motion estimation is employed in 
combination with a hierarchical VQ on the displaced frame 
differences. There exist also various two-dimensional (2-D) 
wavelet-based and subband schemes [3]-[5]. Motion esti- 
mation and compensation turn out to be difficult in the 
wavelet/subband domain. One difficulty arises from the shift 
variance of downsampling in the wavelet domain [3]. Another 
reason is decreased efficiency of motion compensation on 
bandpass frames since the high-pass signal components are 
disturbed due to noise and aliasing. Therefore, most ap- 
proaches perform motion compensation on low-pass subbands. 

In this paper, a hybrid spatially scalable video coding 
algorithm is presented which is based on a Gaussian pyra- 
mid decomposition. Motion estimation and compensation are 
performed between each pyramid level such that the resulting 
prediction errors are close to a Gaussian pyramid decomposi- 
tion of the prediction error at the highest resolution level. For 
coding of the displaced frame differences (DFD) as well as 
the motion vector fields, the statistical dependencies between 
the pyramid levels (corresponding to the different resolutions) 
are utilized by coding jointly all resolutions with a compact 
Laplacian pyramid representation. More specifically, the DFD 
is decomposed into a centered Laplacian least squares spline 
pyramid (construction and properties of this kind of pyramid 
are derived in this paper), and subsequently coded in an 
embedded fashion. 
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One of the most efficient embedded image coding algo- 
rithms has been developed by Shapiro [6], and is called 
the embedded zero-tree wavelet (EZW) coding algorithm. 
It consists of a transformation of the image into a wavelet 
domain, and subsequent embedded coding with the zero-tree 
algorithm, which starts with coarse quantizations and refines 
the wavelet coefficients in subsequent passes. Taubman [7] 
developed a three-dimensional (3-D) subband video coder, 
which (restricted to 2-D) resembles the EZW coder in the 
aspects of decomposition and layered quantization. Instead of 
using zero-trees for coding of significance maps, he used a 
conditioning context to predict zeros in dominant passes. He 
shows that this kind of general prediction context provides 
significant gain over the zero-tree coding method. In both 
cases, a wavelet decomposition has been used. In a previous 
work [8], we adapted the EZW coding scheme to a Laplacian 
pyramid decomposition, and discussed the usage of conditional 
arithmetic coding in a Laplacian pyramid. We showed there 
that the conditional arithmetic coder outperforms the pyramid 
zero-tree coder. 

In this paper, we adapt the Laplacian pyramid coding of dis- 
placed frame differences to a spatially scalable coding scheme. 
We have to deal with two DFD's per frame, corresponding 
to the base layer and the refinement layer of the codec, The 
DPCM loops for both layers are coupled such that the DFD 
in the base layer is "close" to a reduced version of the DFD 
of the refinement layer. This requires a careful design of the 
filter operators which accomplish the transition between the 
different resolutions. Requirements for these filters are given 
in this paper, together with a new filter design which fulfills 
these requirements. 

The paper is organized as follows. In the next section, 
an overview of the hybrid spatially scalable coding concept 
is given. The third section discusses motion compensation 
for scalable video coding in general, and mentions partic- 
ular implementation issues of the proposed scheme. In the 
fourth section, desirable properties for the reduce and expand 
operators (which define the pyramid decompositions) are sum- 
marized. While most of these properties can be fulfilled with 
any least squares pyramid [9], employment in the scalable 
codec makes the centering a desired feature. The derivation 
of such pyramids is included in the Appendix. The fifth and 
sixth sections describe DFD coding and vector field coding 
in detail, and the last two sections are devoted to simulation 
results and some conclusions. 



II. Design of a Spatially Scalable Coding Scheme 

We start with a discussion of the motion-compensated 
predictive coder shown in Fig. 1. The current frame is denoted 
by g n and the prediction of g n by g n . The DFD d n = 
9n - 9n is decomposed into a Gaussian pyramid. A Laplacian 
pyramid is derived from the Gaussian pyramid and coded 
in embedded fashion using arithmetic coding. The prediction 
g n is calculated as a motion-compensated version of the 
previously reconstructed frame g n -i using overlapping blocks. 
For motion estimation between the frames g n and ^ u _i, a 
gain-cost criterion is employed. The motion vector field v n is 
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Fig. I . Block diagram of PYRACO, a predictive multiresolution video coding 
scheme (nonscalable version). 
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PYRACO — 
H.263 (TMN-5) 




50 100 150 200 

frame # 

Fig. 2. PSNR of sequence foreman coded at 48 kbit/s and 5 fps (fixed). 



coded Iosslessly in a hierarchical fashion [10]. Fig. 2 shows 1 
that this coding approach, which is termed PYRACO, reaches 
approximately the same coding performance as the currently 
best reference coder H.263 (TMN-5 implementation) [II]. An 
important difference is that PYRACO outputs a constant bit 
rate per frame, which simplifies coder control and reduces 
coding delay, while TMN generates a significantly varying 
rate. 

The multiresolution image representation as a Gaussian 
pyramid offers a natural design choice for a spatially scalable 
coding scheme with multiple resolution levels. For simplic- 
ity of presentation, we consider in the following only two 
resolution levels. A block diagram of the spatially scalable 
hybrid encoder is depicted in Fig. 3. The base layer coder 
emits the lower spatial resolution, and follows the same coding 
principle as the coder in Fig. LA temporally predictive coding 
approach is also used for encoding of the refinement layer t 
which provides the higher spatial resolution. 



1 The test sequence has been obtained from the CCIR sequence using spline 
filters [9] for preserving sharpness. The TMN codec runs with all options 



except PB frames. 
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Fig. 3. Encoder of a two-level scalable predictive video coding scheme {the base layer is shaded). 



k=0 




Fig. 4. Relations between two resolutions in a predictive video coding scheme. 



A. Aim for Motion Compensation to Achieve Scalability 

The first step is a decomposition of the current frame g n 
into a Gaussian pyramid G n of two levels 2 



holds, which is equivalent to the condition 



(1) 



R{<) denotes the so-called reduce operator, which consists 
of a linear low-pass filter followed by subsampling of the 
image in each dimension by a factor of 2. The aim is now 
to employ a predictive video compression scheme for both 
resolution levels of the frame to be coded. Hence, both levels 
are decomposed into a predicted frame and the prediction error, 
which is depicted in Fig. 4 reading the figure horizontally. 
However, reading Fig. 4 vertically, the prediction errors of 
both resolution levels form a pyramid. This pyramid can be 
coded efficiently [8] if it is a Gaussian-type pyramid. 

Hence, the idea for the design of a spatially scalable pre- 
dictive video compression scheme is to calculate predictions 
gn^ for each level k 

dW= 9 P-gl k K * = 0,1 

such that the relation 

R^^sP-W (2) 

2 Throughout the paper, we use the convention that level 0 corresponds to 
the refinement layer, and level I to the base layer. 



/l(</i 0) )=<7i l) - 



(3) 

The predicted levels are obtained by motion compensation of 
the corresponding level of the previous frame: 



5W = MC(s<« 1> «W). 



If the predictions satisfy (3), the prediction errors in Fig, 4 
indeed constitute a Gaussian pyramid D n . 

In a spatially scalable coding scheme, the prediction for 
the base layer needs to be calculated without considering the 
higher resolution level. This constraint is the central difficulty 
in the design. 

Since the operator MC(-) is nonlinear and space variant, 
motion compensation and low-pass filtering cannot be inter- 
changed in general. Furthermore, subsampling is a shift-variant 
operation causing additional aliasing errors. Therefore, the 
low-pass filtered prediction from level k = 0 differs from the 
prediction on level k = 1, even if the motion is known exactly 
at level 0 [13]. Hence, equality in (3) cannot be obtained in 
general. 

In order to achieve high compression efficiency, the aim 
is to find predictors for each scalable level such that (3) is 
approximated sufficiently accurate 



(4) 
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The design objective is to make the prediction of the base layer 
similar to the reduced prediction of the higher resolution level. 
The approach followed in this paper calculates the base layer 
prediction on an expanded frame, and subsequently applies a 
reduce operator. It is shown in Section III that by appropriately 
selecting the expand and reduce filters, (4) can be achieved 
sufficiently accurate. It turns out that the same filters as for 
the DFD pyramid coding are suitable. 

B. Pyramid Decomposition of the Base Layer DFD 

The base layer DFD is decomposed into a Laplacian 
pyramid using centered cubic splines (see Appendix B), which 
results in high energy concentration into the higher pyramid 
levels. The decomposition follows the Gaussian-Laplacian 
pyramid approach [12]. First, an intermediate pyramid struc- 
ture, the Gaussian DFD pyramid D\ is generated. The pyramid 
is initialized at the lowest level / = 1 with the original 
base layer DFD we want to code. From this level, a coarse 
approximation is derived by applying a reduce operator i£( ). 
This procedure is iterated, resulting in the base layer Gaussian 
pyramid 

V = {(d^r-^d^)\d^ = ji^*- 1 *), 

k = 2,~.,L-l) (5) 

consisting of L - 1 levels. 3 

For computation of the corresponding Laplacian pyramid, an 
expand operator E(-) is defined, which predicts a level of the 
Gaussian pyramid from the next higher level, thus expanding 
the size again by a factor of 2 for each dimension. The expand 
operator consists of upsampling and subsequent filtering with 
a linear interpolation filter. The Laplacian pyramid L' captures 
the loss of information occurring through the reduction of 
the spatial resolution by means of the difference between a 
Gaussian image at level A: and the expanded version of the 
Gaussian image at level k + 1: 

^(/W-,^) (6) 

where 

ZW^^-E^+D), jfc = !,...,£- 2. 

The top level of If is "initialized" with the top level of D* 

iiL-x) = d {L-i) 

Knowledge of the Laplacian pyramid is sufficient to recon- 
struct the base layer DFD c/ (1) . Thus, an encoder can build 
up a coarse representation and successive refinements (the 
Laplacian levels) of an image, and a decoder can reconstruct 
the original from this information. 

According to our convention, the complete pyramid consists of I levels, 
and therefore the base layer pyramid has L - 1 levels. 



C. Approach for DFD and Vector Field Coding 

As is depicted in the block diagram of the spatially scalable 
hybrid encoder in Fig. 3, motion estimation and compensation 
are performed on the resolution levels 0 and 1, yielding a 
prediction error for the base and the refinement layer. The base 
layer prediction error is decomposed further into a Gaussian 
pyramid according to (5). Because the receiver needs only to 
know about the expand operator, the levels of the pyramid 
can be decomposed by employing different reduce opera- 
tors, which may even be nonlinear. Therefore, the complete 
Gaussian pyramid of the prediction errors D is obtained by 
concatenating the prediction error of the refinement layer and 
the Gaussian pyramid of the base layer 

D = (</<°> (7) 

From this pyramid, the complete Laplacian pyramid 

L = { (fW/W - . . , l( L ~V)\lW = <fW _ £(^+1)), 

* = 0, 2, = d^- 1 )} (8) 

is calculated, which is sufficient for reconstruction of the 
DFD at the refinement layer decoder. The base layer decoder 
needs only to decode the truncated Laplacian pyramid L f . 
The Laplacian pyramid is quantized with a layered quanti- 
zation scheme. The coefficients are selected and quantized 
according to their amplitude in decreasing order. Due to the 
energy concentration, mainly coefficients from higher levels 
are coded first. The resulting symbol streams are coded using 
conditioning contexts and adaptive arithmetic coding [8]. Since 
the context choice depends only on the current or higher levels, 
scalability is retained. A detailed description of the coding 
scheme and the filters is given in Section V. 

The open-loop approach, where the Laplacian pyramid 
levels are quantized independently, simplifies layered quan- 
tization, and allows for embedded coding. Furthermore, as is 
shown in (41) in Appendix A, the feedback of the quantization 
errors of the Laplacian pyramid within the DPCM loop does 
not degrade the performance since the current level iffi is 
affected only by the quantization noise of the corresponding 
level of the previous frame 1^1 v 

The Laplacian pyramid coding principle is adopted for 
coding of the motion vector fields. The main differences are 
that for both vector field components independent Laplacian 
pyramids are calculated, and the pyramids are coded losslessly. 
The vector field of the base layer is decomposed into Laplacian 
pyramids, and to obtain the complete pyramid, the motion 
vector differences of the refinement layer are added as level 
0. Since the vector field components have finite precision, 
quantization is replaced by bit plane coding. This topic is 
elaborated in Section VI. 

III. Motion Estimation and compensation 

The motion model used for this coder concept assumes that 
the motion of the 3-D objects can be described completely by 
displacements v n (x) on the image plane. Denoting the location 
of the pixels on the image grid by x = (:c,y) T , frame g n is 
related to g n - i by 

9n(x + Vn(x)) = gn-l(x) 
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neglecting occlusions, uncovered background, and global illu- 
mination changes. The displacement vector field for frame g n 
is the set of all displacement vectors v n . 

The aim of motion estimation in the context of coding is 
to find displacements v n (x) with respect to the previously 
decoded frame # n _i 

v n = UE(g ni g n -i) 
such that the prediction error 

e D FD = £ \\9n{x)-g n (x)\\ (9) 

X 

becomes minimal according to a norm. The prediction signal 
g n is obtained typically by displacing pixels of the previous 
frame, which may be weighted additionally (e.g., overlap block 
motion compensation) using a weighting function Wi 

9n( X ) = ^2 w i( x ) '9n~l{x - V n {Xi)) = MC(j/ n _i, t)„). 

(10) 

A. Optimal Motion Compensation for the Base Layer 

For the moment, we assume that the original previous frame 
is used for "motion compensation." On the base layer, motion 
compensation is performed on low-pass filtered and downsam- 
pled frames, which limits the prediction performance even if 
the motion is known. One reason is that subsampling causes 
aliasing errors. Another one is that motion compensation and 
low-pass filtering cannot be interchanged in general 

/*(MC( 5 « ,«<*>)) * Mcf/H^J,/)) (1 1) 

because the low-pass filter //?(•) weights and sums up spatially 
neighbored pixels. Thus, the filtered pixels are different if 
the original pixels are displaced prior to filtering [13]. An 
exception is the case of constant displacements v(x) = const., 
where all pixels undergo the same displacement. Low-pass 
filtering and motion compensation are then interchangeable, 
and equality holds in (11). 

A deeper analysis reveals that motion can be interpreted as 
a shearing of the image spectrum into a hyperplane along the 
u/ t axis in the 3-D frequency space, and motion compensation 
reverts the shearing [14], Therefore, low-pass filtering prior 
to motion compensation causes degradations since the hyper- 
plane rather than the plane of the original image spectrum 
is multiplied by the filter transfer function. This interpretation 
also reveals why motion compensation prior to filtering results 
in more accurate predictions than compensation after filtering. 

According to our motion model, where the moving scene 
content results in displacements of pixels in the image plane of 
the camera, the next frame is generated by displacing the pixels 
of the previous frame followed by filtering and subsampling. 
Therefore, for motion compensation of the base layer using a 
given motion vector field Vn \ the low-pass filtered prediction 
from the highest available resolution level results in a higher 
prediction gain than calculating the prediction on the lower 
resolution level [13] 

^=^(MC(W)), * = 0,1. (.2) 



This type of prediction is termed in the following optimum 
prediction. The superscript k denotes that the reduce func- 
tion is applied k times. In principle, (12) holds also if the 
reconstructed previous frame is used for motion compensation. 
Hence, the Gaussian-type pyramid D n formed by the displaced 
frame differences 

has minimal energy. Furthermore, due to the linearity of the 
reduce operator 




holds for each level of D n - Thus, D n is identical to the 
pyramid obtained by decomposing the prediction error of level 
k = 0. It is shown in [8] that such a pyramid can be coded 
efficiently. 

B> Scalable Motion Compensation 

This is the crucial part in a spatially scalable coding scheme 
since the prediction for the base layer needs to be calculated 
without considering the refinement level. The design aim with 
respect to (2) is to achieve equality of 

*(MC(j£2i.t4°>)) « Mc(«(e),^) d3) 

with Vn^ being a suitably reduced version of vi°\ Due to 
(11), an optimal prediction for the base layer g^ cannot 
be obtained from g^li in general. Therefore, the aim is to 
approximate ^p P t as closely as possible which, with respect 
to (12), is equivalent to approximating the refinement layer 
g^-i as accurately as possible. The idea is to interpolate the 
base layer frame g^ x such that the distance to the refinement 
layer g^ t is minimized in the least squares sense: 

l^-^Of-^nin. (14) 

In real coders, only the coded previous frame g^ x = g^2i + 
q'^li is available, where a* denotes the quantization error (Ap- 
pendix A). However, (14) remains valid since the quantization 
error of level 0 is fixed 

o Co) -E(a {l) \-o {q) \±n { - 0) 

Motion compensation is performed on the expanded frames, 
and the final prediction is obtained by reducing the compen- 
sated, expanded frame 

= *(Mc(£(^),e>)), fc = 0, 1. (15) 

Since the equation can be interpreted for the base layer as 
half-pel motion compensation (assuming interger valued v^), 
the same technique is also applicable for the refinement layer 
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k = 0. The vector fields Vn are obtained by motion estimation 
on the expanded frames 

v^ = ME(E(g^),E(g^ l )), A = 0,1. (16) 

Note that (15) depends on no particular method for motion 
estimation and compensation. 

To achieve a good approximation of g^ opl , tne operator 
pair £(•), R(-) needs to meet further constraints. The chain of 
operators must equal the identity operator 

This characteristic ensures that filtering effects do not distort 
the prediction. Furthermore, for the special case of global 
constant translational motion, the optimum prediction can be 
obtained. 

Additionally, the filters must account for the fact that motion 
is a local property. Hence, the filters should be spatially 
localized or, in other words, the impulse response of the filters 
should have compact support. 

The required properties are provided by centered third-order 
spline filters, a modified version of the cubic spline filters 
described in [15]. The class of spline filters allows for a 
tradeoff between spatial localization and aliasing since they 
include the Haar filter as a filter of order 0 as well as the sine 
interpolator for infinite order. The distinctive feature of the 
proposed filters is the centering of the lower resolution grid 
with respect to the higher resolution grid. The advantage for 
scalable motion compensation is that the centered spline filters 
do not introduce a half-pel shift, and displacements remain at 
the same location in different levels. In Section IV, the filter 
design is linked to the Laplacian pyramids, and a detailed 
description is given in Appendix B. 

C. Implementation Aspects 

Since in a scalable video coding scheme the refinement level 
is not available to the base layer encoder, motion estimation 
is performed top down starting at the base layer. The method 
for motion estimation and compensation can be chosen freely 
in principle. However, an approach based on block matching 
is simple and robust. Another advantage of a block-oriented 
scheme is that low-pass filtering and motion compensation are 
interchangeable for — const. (11). Therefore, locally using 
a translational motion model is advisable. 

To obtain a smooth vector field, a gain-cost criterion is 
evaluated as distance measure for each block j [16] 

v n {j) = argmin { log(e 2 (t/)) + A • V \\v - v n {i)\\ \ 
veV { if", J 

A = const. (17) 

with J\fj denoting the set of neighbored blocks of block 
j. The calculation of the prediction error Cj for block j 
according to (9) employs overlapping blocks; hence, (17) 
includes overlap block motion estimation. Furthermore, mo- 
tion compensation causes no blocking artifacts. The factor A 
controls the smoothness by penalizing large vector differences. 
Due to the interaction with neighbored blocks, the vector field 



needs to be calculated iteratively. However, the computational 
load is still less than for full search block matching since the 
test vector set V is very small (18 candidates), and only about 
four iterations are sufficient. Since estimating motion at the 
expanded frames (16) can be interpreted as half-pel motion 
estimation, no additional half-pel motion estimation procedure 
like bilinear interpolation is needed. 

At the refinement layer, the same gain-cost constrained 
block matching scheme applies. The vector field from the 
base layer is taken into consideration. Due to the constraint of 
(2), the predictions must be consistent across scales. Hence, 
the size of the blocks is increased by a factor of 2. A block size 
of 16 x 16 at level 0 corresponds to a block size of 16 • 2~*x 
16- 2~ k at level A:. As a consequence, the vector field sizes 
of the scalable resolution levels are equal. The vector field 
E v (vn^) serves as an initial estimate for motion estimation, 
calculated by an appropriate scaling function E v . Since the 
number of vectors at both levels is the same, E v scales only the 
amplitudes of the vector components by 2. Regarding Vn^ as 
average motion, the motion vector field v£~ l * can be written 
as 

v^=B v (v^)+Av^ (18) 

where Av5t~ 1 ^ denotes the refinement. The gain-cost criterion 
constrains the motion estimate to smooth vector fields elimi- 
nating spurious vectors. Finally, a hierarchy of motion vector 
fields has been obtained, and needs to be coded (Section VI). 

IV. Choice of Reduce and Expand Operators 

The choice of the reduce operator R(-) and the expand 
operator E(-) is the key issue to achieve efficient multiresolu- 
tion motion estimation and compensation, as well as efficient 
coding of the Laplacian pyramid of the displaced frame 
differences. 

In the previous section, we compiled operator conditions for 
multiresolution motion compensation. First, the interpolation 
of the base layer should approximate the refinement layer in 
the least squares sense. Furthermore, the operator chain E(<) 
followed by H(-) should be the identity operator. 

For DFD coding, the main objective for the filter design 
with respect to the Laplacian pyramid is to shift the signal 
energy into the higher pyramid levels. Concentration of the 
energy in only few pixels results in the desired coding gain. 
This is equivalent to the objective of minimizing the energy of 
the lower levels of the Laplacian pyramid. A useful criterion 
is that the approximation operator [R(-) followed by £(•)] 
should be optimal in the least squares sense. 

Comparing both sets of conditions, it turns out that they are 
identical for both tasks. Hence, one set of filters suffices. We 
chose spline filters, and found that centered third-order filters 
performed best. In Appendix B, the filters are described in 
detail, and it is proven that they provide the desired properties. 

V. DFD Pyramid Coding 

The two-layer scalable coding scheme under consideration 
outputs two displaced frame differences d^°\d^ for each 
coded frame. The base layer DFD is decomposed into 
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(a) (b) 

Fig. 5. Least squares Laplacian pyramid (LSLP) decomposition of DFD 1 1 of sequence Silent, For display, the amplitudes have been multiplied by two for 
the Gaussian and by four for the Laplacian pyramid, (a) Gaussian DFD pyramid D, (b) Laplacian DFD pyramid L, 



a Laplacian pyramid. As explained in Section II, a Laplacian 
level /(°) is derived from the refinement layer DFD d^°\ and 
this level builds together with the truncated pyramid ll the 
complete Laplacian pyramid of both DFD*s. 

This results in energy concentration into the higher (low- 
pass) levels of the pyramid, which consist of considerably 
fewer pixels than the original image. Usually, the energy is 
not equally distributed within the levels. This is due to the 
nature of motion compensation: some areas in the original 
sequence may contain complex motion where the motion 
compensation fails, while other areas contain no or only simple 
translational motion. The DFD energy will be the highest 
in areas where motion compensation failed. The severely 
limited bit budget for DFD coding in low bit-rate coders only 
allows encoding of selected areas of the DFD. Most DFD 
coding schemes encode the address information as overhead 
(addressing the areas, which are actually coded) separately 
from the quantized data. We use instead a layered coding 
method, where address information and data are encoded 
jointly in significance maps. 

The coding gain which can be achieved with the Laplacian 
pyramid depends strongly on the employed filters. We have 
chosen a centered version of the cubic spline least squares 
Laplacian pyramid (LSLP) because this set of filters performed 
best of all investigated linear filters (see Appendix B). 

In Fig. 5, the LSLP decomposition is shown for the frame 
difference between frame 1 1 and the motion-compensated 
prediction for this frame obtained from the reconstructed frame 



6 of the image sequence "silent." The lowest level of the 
pyramid consists of 352 x 288 pixels (common interchange 
format, CIF). In Fig. 5(a), the Gaussian DFD pyramid D is 
given; in Fig. 5(b), the corresponding Laplacian pyramid L 
is given. The top three images in each subfigure constitute 
the truncated pyramids and L' (base layer DFD and 
Laplacian pyramid). The large images (ft 0 ) and belong 
to the refinement layers. 

One can see that the base layer decomposition of 
into the truncated pyramid L* yields energy compaction into 
the higher pyramid levels. The design objective R((ft 0 )) « 
does not hold with equality; therefore, contains 
more energy than the other levels of the Laplacian pyramid. 
Especially where strong motion appears (the raising hand), 
(ft 0 ) cannot be predicted well from d^K On the other hand, 
/(°) contains much less energy than d^°\ which indicates 
that predicting d^ from e^ 1 ) is better than no prediction 
at all. Doing without prediction leads to the simulcast case 
(independent coding of base and refinement layer), which is 
used for comparison in Section VII. 



A. Layered Quantization 

Let denote the pixel value at location (z, j) of level / 

of the pyramid L. The range of the indexes is i e 0 • • • I/2 1 - 1 , 
i € 0 • • • J/2' - 1, and I € 0 • ♦ - L — 1, where I and J are the 
dimensions of the original frame. The pyramid is scanned top 
down (starting with I — L — \ down to / = 0), resulting in a 
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sequence of pyramid values x[i]. The scan order inside each 
level is a simple line scan. 

Scalability is achieved by progressive quantization in a 
sequence of up to N layers. Each layer can be described by the 
associated quantizer. It is convenient to differentiate between 
two sets of quantizers QoV"*>Q/v-i and 2f» ' * ' > fi/v-i- 
The first set of quantizers belongs to the dominant passes, and 
the second to the subordinate passes of the coding algorithm. 
These passes alternate, starting with a dominant pass. Hence, 
during the coding process, the sequence Qo\ Qf , Qf , Qf , 
Q^, • * • of quantizers is employed. 

Each quantizer Q n is defined by a set of disjoint quantiza- 
tion intervals X n ,ki ? ^n,Jt 2 j ■ ■ ■ and the quantization function 

Q?'(r) = fc, forx€l£, (19) 
C£(x) = fc, for x 6 1** (20) 

which maps each input value x to an index A;. 

In order to allow efficient layered quantization, the sequence 
of quantization intervals must form a set of nested intervals. 

The "dominant" intervals T® k are symmetric around zero, 
and are uniformly spaced with a dead zone twice as large as 
all of the other quantization intervals. From layer to layer, 
each quantization interval is halved. Thus, the intervals are 
specified with a single parameter A 0 as follows: 

f(-A n ,A n ), iffc = 0 

[^.(fc + lJAO, if*>0 (21) 
(((*-l)A n) fcA n ], iffc<0 

with 

A n = A n _!/2, forn>0. (22) 

The "subordinate" intervals X^ k are refinements of the 
intervals X^_ x k . More precisely, each interval T^_ l k con- 
tains two intervals k of equal width, except for k = 0, 
where X^_ li0 = 2^ 0 . Hence, the "subordinate" intervals are 
symmetric around zero, uniformly spaced with a dead zone 
four times larger than the other intervals 

fZ?_ 1>0 . if* = 0 
J^. = {0, if* €{-1,1} (23) 

[I^ k , if kf {-1,0,1}. 

Note that each quantization interval Q% is contained in 
some quantization interval of Q®_ v Moreover, the intervals 
J£ 0 contain the three intervals I»+ lt _ v 2^ +lj0) J^ +lfl , 
whereas all other intervals I® k (with k ^ 0) contain two 
intervals J^+i^fci ^?+i,2Ah-i- (Similar properties exist for the 
"subordinate" intervals since these are defined through (23) 
via the dominant intervals.) This is the only condition on 
the quantization intervals which must be imposed. However, 
specifying the intervals as in (21) is very convenient because 
the set of quantization functions is defined by A 0 and N alone. 

B. Symbol Stream Generation 

We define two sets of sequences a' n [i\ and 6' n [i] as follows: 

= <£(*[*]) (24) 
*nH = QfoH) (mod 2). (25) 



Since the quantization intervals form a set of nested inter- 
vals, all information necessary to recover the current quanti- 
zation intervals is contained in these sequences. Moreover, if 
these sequences are ordered in the same way as the quantizers, 
i.e., aofij, S[[i]j ^[i],---, a part of this sequence is still 
redundant. Particularly, 1) every S' n [i\ for which <r' n ^i[i] = 0, 
and 2) every value o' n [i] f {-1,0,1} can be predicted 
from previous parts of the alternating sequence ct^i], S[[i], 

'![«•]>•••• 

Thus, the variable length strings <7o, * • ■ contain the 

same information as the strings <r 0 [*]> ^IWr'** ^ we 

denote by a n and S n the strings obtained by removal of the 
redundant entries of a' n [i] and 6^[z], respectively. 

Removal of the redundant entries works the same way as in 
Shapiro's EZW coder [6]. In the first dominant pass, all entries 
are scanned; thus, a' 0 equals oo- The first subordinate pass 
deals only with the samples which have become significant in 
the first dominant pass. In accordance with condition 1), those 
values of 8[[i], for which <Jq[i\ has been zero (insignificant), 
are omitted in Si. Condition 2) means that all samples which 
have become significant during previous dominant passes are 
not regarded in any subsequent dominant pass. 

The symbol stream consisting of the concatenation of the 
strings an, * * • is encoded arithmetically, which is described 
in the next subsection. 

C. Conditional Arithmetic Coding 

The strings S n are binary, and show only very little cor- 
relation between the letters. Thus, they are encoded with an 
adaptive arithmetic coder without regarding any context. 

The three- valued strings a n are encoded with an arithmetic 
coder conditioned on context [17], which is collected from 
previously reconstructed values of ajji]. For this task, a 
conditioning sequence n n [i] is constructed from o-' n [i] by 
thresholding 

r-i, if<[t]<o 

Kn[*l = <0, ifV„[t] = 0 (26) 
(+1, ifV B [i]>0. 

Denote by (j, k) the coordinates and by / the pyramid level 
of the pixel belonging to the index i. For convenience, we 
set K l n (j, k) = K n [i]. Then the values of K l n (J - l,k) and 
K l n (j, k - 1) are used to build a local conditioning context. 
Additionally, the significance information of r^ l {j/2,k/2) 
(i.e., |Kn +1 (i/2,fc/2)|) is used to utilize correlations of sig- 
nificance across scales. The implementation of the arithmetic 
coder follows the paper of Witten et al [18]. 

D. Inverse Quantization at the Decoder 

Until now, the pyramid encoder has been described. The 
decoder receives an arithmetically encoded bit stream, decodes 
it into a stream of symbols, and reconstructs from this symbol 
stream successively refined quantization intervals. For each 
coefficient, a value of the current interval must be chosen as the 
reconstruction value (inverse quantization). The reconstructed 
Laplacian levels are suitably expanded and summed up, giving 
the base and refinement layer DFD's dS 1 ) and d^°\ 

There are several possibilities for designing the inverse 
quantization function. In Shapiro's original EZW coder, the 
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reconstruction values are chosen in the middle of the current 
interval. Since our pyramid decomposition is overcomplete, 
there is some redundancy in the unquantized pyramid. This 
means that not all possible combinations are consistent with 
the employed reduce and expand operators. We utilize this 
redundancy in the pyramid for quantization error reduction. 

Theorem J: Denote by Iq) the Arth Laplacian level with 
independently quantized values. Then the quantized Laplacian 
levels 

1^=1^-E(r(i^))), * = l,...,L-2 (27) 

are, in general, closer (in the mean-squared sense) to the 
unquantized levels than the levels Iq). 

Proof: For least squares pyramids, the concatenation of 
reduce and expand operators yields the identity operator or an 
orthogonal projection operator P (depending on the order) 

R(E{x)) = x 

and 

P(x) := E(R(x)) = P{P(x)), Vx. (28) 

These properties are derived in Appendix B. 

All "true" Laplacian levels (i.e., the levels fi k \ k — 
1, -..,£,- 2) have the form J<*> = d<*> - E(R{d^)) = 
- P((f(*)), Keeping this in mind, it is easy to see that the 
projection of an unquantized Laplacian level is zero 

P(l) = E(R(l)) = Q. (29) 

Denote by /q/ a Laplacian level with independently quan- 
tized values. We now decompose these levels into a sum of 
the true Laplacian level and a quantization error 

l<y = l + q. (30) 

Application of the projection operator P on both sides leads to 

P(l Q >) = P{q) (31) 

and 

kr - P(i<r) = Uy - P(q) = l + q- P{q)- (32) 

Thus, application of (27) projects the quantization error 
q linearly into the null space of P. Hence, for the new 
quantization error q - P(q), we have \\q - P{q)\\ < \\q\\ with 
equality if and only if q belongs to the null space of P. □ 

To utilize the above-mentioned redundancy of the pyramid 
decomposition, we first choose reconstruction values inde- 
pendently based on the quantization intervals known to the 
decoder. Then we project the quantization error into the vector 
space of admissible Laplacian levels by applying (27). 

VI. Displacement Vector Field Coding 

An efficient and elegant coding method is derived by 
adopting the pyramid approach for coding of displaced frame 
differences. Specific aspects are that vector fields need to be 
coded lossless, and each vector consists of two components. 
The concept of conditioning contexts is employed because 



it is much more flexible and even more efficient than zero- 
tree coding [10]. The flexibility i s especially important for the 
scalable codec. 

At first, coding of the base layer motion vector field 
is described. Similar to the coding approach for the prediction 
error, both components v x and v y of the motion vector field 
are decomposed separately into a Laplacian pyramid L x 
and L y of L v — 1 levels, respectively. For simplicity, only one 
component is mentioned in the following, where no ambiguity 
can occur. The levels of the pyramid are obtained by 

,(L„-D = (33) 

where 

t/< fc > = R v (vi k ~ 1 ^ k = 2, - ■ ■ , L v - 1 (34) 

denotes the levels of the intermediate Gaussian pyramid. The 
reduce and expand operators E v (-) need to be designed 

carefully to obtain optimal compression. On the one hand, 
a high energy concentration into the higher pyramid levels 
should be achieved; on the other hand, one should be able 
to code the remaining residuals at lower levels efficiently. 
The two important properties of the coding technique of 
significance maps are that zero coefficients can be coded 
efficiently, and that one large residual coefficient is more 
expensive to code than two small coefficients. 

The operators Rv(-) and E v (-) have a 2 x 2 support to 
obtain a quadtree-like pyramid structure. As expand operator 
E v (<), simple repetition is employed. The nonlinear operator 
/£,,,(•) used for reduction termed closest couple provides the 
best performance in our experiments. To achieve that at least 
one coefficient of the Laplacian pyramid becomes zero, this 
operator outputs one element of a block of 2 x 2 vector 
components. The selection criterion maximizes the number of 
small coefficients by first searching for the coefficient pair 
which has the smallest difference. The coefficient of the pair 
which has the smallest difference to the remaining two values 
is taken as output. Since half-pel motion compensation is used, 
the vector field components are scaled by a factor of 2 and 
treated as integers. Hence, the coefficients of the Laplacian 
pyramid also have integer precision. 

The vector fields must be coded lossless, and the integer 
precision allows for simple bit plane coding, which could also 
be interpreted as a specialized version of layered "quantiza- 
tion." Instead of calculating a threshold A as for DFD coding, 
the position of the most significant bit is calculated. Halving 
of the threshold is equivalent to switching to the next lower 
bit plane. Significant coefficients are the coefficients with a 
bit set in the currently selected bit plane. Therefore, similar to 
DFD coding, a significance sequence o~ n is obtained for each 
bit plane, which is coded employing conditioning contexts 
followed by adaptive arithmetic coding [18]. The formulation 
of the different contexts is based on the conditioning sequence 
K n {i]. 

Due to the quadtree structure, one class of contexts depends 
only on conditioning symbols K n [i] within a 2 x 2 block. 
To keep the context causal, unavailable conditioning symbols 
tt n [i] are taken from the previous layer From the 
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3 3 possible states, the seven most frequently occurring states 
(experimentally determined) are used. Since, in the most 
significant bit plane, only a few coefficients become significant 
while most coefficients become significant later, the set of 
contexts is switched after scanning the second bit plane. In 
the case that all coefficients for a block are insignificant, the 
corresponding conditioning symbol of the next higher level is 
taken into account. 

Finally, a different set of contexts is defined for the top 
level of the pyramid, which has a low-pass characteristic and 
no block structure, in contrast to the band-pass characteristic of 
the tower levels. This set is the same as used for DFD coding, 
except that no information on higher levels is available. 

Regarding the scalable video codec, a hierarchy of the vector 
fields is generated (Section 11I-C). The vector fields do not 
form a pyramid since and have the same number 
of vectors. For calculating the refinement level of the 
Laplacian pyramid in (33), (18) is directly applicable 

lW=W-2.(vW) (35) 

replacing the expand function E v by the scaling function E v . 
Since the decoder knows the motion vector field resolution 
anyway, the refinement vector field can be reconstructed 
from the Laplacian "pyramid." For coding, the concept of 
conditioning contexts is still well suited because the symbols 
denoting the bit plane (quantization interval) can be calculated 
independently of the higher resolution levels. Moreover, the 
conditioning contexts for entropy coding of the symbol stream 
are designed such that they depend only on elements from the 
same level. 

VII. Simulation Results 

First, the suitability of the motion estimation and compen- 
sation scheme (15) in the context of the proposed scalable 
coder is demonstrated. Frames 6, 11, and 16 of the sequence 
foreman 4 have been coded with a nonscalable codec as well 
as with a scalable codec. The base layer prediction error has 
been decomposed into a Gaussian pyramid of four levels. 
Correspondingly, in the nonscalable coder, the DFD has been 
decomposed into five levels. For both codecs, the highest 
spatial resolution is CIF, hence, the base layer (level I) has 
QCIF resolution. For motion estimation and compensation, 
blocks of 32 x 32 are matched on the refinement level; thus, 
the block size on level 1 is 16 x 16. Both coders run at the 
same bit rate of 96 kbit/s at 5 frames/s (ftps), and since the 
coder control assigns a constant bit budget to each frame, the 
coding results can be compared. The first frame (1) has been 
coded with 48 kbits. 

In the nonscalable coder, the prediction for the "base level" 
is obtained by reducing the prediction of the "refinement 
level," according to (12). This is equivalent to a reduction of 
the reconstructed refinement layer to obtain the reconstructed 
base layer frame. The prediction gains measured in decibels 
for the nonscalable and the scalable codec in Table I are 
comparable. 

4 The sequences arc obtained from the original CCIR sequences using least 
squares cubic spline filters. 



TABLE I 

PSNR in Decibels of the Resolution Levels Aftgr Motion Compensation 



Frame 


Level 


Nonscalable 


Scalable 


6 


0 
I 


26.9 
27.7 


27.1 
27.9 


11 


0 
i 


28,7 
30.2 


28.5 
29.8 


16 


0 

1 


29.8 
31.6 


29.2 
30.9 



The diagrams in Fig. 6 show the overall coding performance 
for the sequence silent, coded at 96 kbit/s and 5 fps with 
both a scalable coder as well as a nonscalable coder. The 
base layer DFD is decomposed into four levels. For motion 
compensation, blocks of 8 x 8 on the base layer and 16 x 
1 6 on the refinement layer have been used. The pyramid for 
coding the base layer vector field has three levels. 

The PSNR of the base layer compared to the coding 
performance of a nonscalable coder (QCIF) running at the 
same average bit rate as the scalable coder (56 kbit/s) is given 
in Fig. 6(a). Except for the first frames, the performance is 
similar, as expected. For the refinement layer, the PSNR values 
are shown in Fig. 6(b). The comparison with the nonscalable 
coder running at the same total bit rate of 96 kbit/s indicates 
an upper bound. Furthermore, the results are compared with 
the simulcast case. The average bit rate of 40 kbit/s devoted 
to the refinement layer in the scalable coder is used to code 
the CIF sequence with the nonscalable coder independently. 

On hand of the following diagrams, the coder is analyzed in 
more detail. Regarding the rate allocation shown in Fig. 6(c), 
the bit rate per frame is absolutely constant, while a varying 
part is allocated by the base layer. The difference is the 
rate available for the refinement layer. With only about 40 
kbit/s, the additional refinement layers can be transmitted at a 
sufficient quality. As an example, the coded refinement layer 
image 61 is shown in Fig. 7. 

The diagram in Fig, 8(a) shows solely the motion compen- 
sation performance of the base layer compared to a nonscalable 
QCIF coder. Both coders run in a closed loop at a fixed 
bit rate including update coding, but only the PSNR after 
motion compensation before update coding is shown. As 
expected, the gain is similar. In Fig. 8(b), the performance 
of the refinement layer compared to a nonscalable CIF coder 
running at 96 kbit/s is given. Although the curves differ due 
to different bit allocations, the overall motion compensation 
gain is comparable. 

As is shown in Fig. 9, coding the CIF sequence container 
at 48 kbit/s and 5 fps with the scalable coding scheme also 
results in a gain compared to the simulcast case, where the 
refinement level coder runs at 25 kbit/s. 

VIII. Characteristics of the Proposed Coding Method 

The proposed scheme for a spatially scalable coder relies 
on a predictive motion-compensated approach. In contrast to 
spatiotemporal coding methods which utilize correlations of 
more than two frames (e.g., 3-D subband coding or motion 
estimation with more than one frame), such coders do not nec- 
essarily introduce a delay of more than one frame. The update 
information (the coded DFD) is calculated on a frame-by- 
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Fig. 6. Coding performance of the scalable coder for the sequence silent at 5 fps. (a) PSNR of the base layer compared to the nonscalable coder running 
at the same bit rate (56 kbit/s). (b) PSNR of the refinement layer compared to the nonscalable coder running at the same bit rate (96 kbit/s) and the 
simulcast case (nonscalable C1F coder at 40 kbit/s). 



frame basis, thus retaining this low-delay property. Therefore, 
the proposed coder is especially suited for communications 
applications. 

One key feature is the decomposition of the displaced 
frame differences into a Laplacian pyramid. In connection with 
overlap block motion compensation, the coder produces no 
blocking artifacts. The artifacts are similar to the artifacts of 
subband coders and are dependent on the used filters. 

There is no need for a special start-up procedure since rough 
(low-pass) approximations of large areas are possible with a 
constrained bit rate. On the other hand, if most parts of the 



DFD are close to zero, the available bit rate can be spent 
in small regions, thus allowing an update of fine details. This 
property simplifies coder control considerably after a scene cut. 
Furthermore, there is no need for an explicit inter/intraswitch 
because the Laplacian pyramid is suited for static image 
statistics as well. 

If only still images are transmitted (or if there is only slight 
motion present), the reconstructed image at the receiver will 
converge to a perfect reconstruction. This means that still 
images are automatically progressively coded (see Appendixes 
A-B). 
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Fig. 6. (Continued.) Coding performance of the scalable coder for the sequence silent at 5 fps. (c) Rate allocation for the base layer and for the complete frame. 




(a) (b) 
Fig. 7, Frame 61 of test sequence silent coded with the scalable coder (CIF, 96 kbit/s, 5 fps). (a) Coded refinement layer of frame 61 at 40 kbit/s. (b) Close up. 



The DFD coder outputs an embedded bit stream, which 
eases coder control and adds flexibility. The sequence coder 
control may truncate the bit stream at any point (usually when 
the available bit budget is exhausted, or the buffer exceeds a 
given threshold). Thus, the coder has the ability to code frames 
at a constant bit rate, eliminating the need for a buffer, and 
thus reducing the delay. Since there is no separate encoding of 
address information and quantized data, the algorithm works 
efficiently in a wide range of bit rates. 

IX. Discussion and Summary 

A new spatially scalable predictive video coding scheme is 
described in this paper. The basic feature is a predictive coding 
approach for the base layer as well as for the refinement layer, 
such that the displaced frame differences can be decomposed 
into a single Laplacian pyramid. 

One major aspect of the paper is the specific design of 
motion estimation and compensation. Motion compensation 
is performed such that the displaced frame difference of the 



base layer becomes similar to the reduced prediction error 
of the refinement layer. Therefore, motion is estimated and 
compensated on interpolated frames approximating the higher 
resolution level in the least square sense. As filters, centered 
cubic spline filters are chosen. The relation of the motion- 
compensated predictions on the base and the refinement layer 
determines the complete coder performance. Further research 
is directed to the improvement of the robustness when the 
predictions differ significantly. 

Motion is estimated in a hierarchical fashion, taking into ac- 
count the motion vector field of the next lower resolution level. 
Since the difference between levels represents an additional 
refinement information, a similar approach as for coding of the 
displaced frame differences is designed to encode the vector 
fields efficiently using a Laplacian pyramid coding approach 
employing conditioning contexts. 

The second main aspect is the design of an improved 
Laplacian pyramid decomposition of the base layer. It turns 
out that centered cubic spline filters are among the best linear 
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Fig. 8. Motion compensation performance, (a) Motion compensation gain of the base layer compared to a nonscalable QCIF coder, (b) Motion compensation 
gain of the refinement layer compared to a nonscalable CIF coder. 
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Fig. 9. Coding performance of the scalable coder for the sequence container at 5 fps. 
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filters fulfilling the desired design objectives. Interestingly, approach the quantization error q( k) at level k is independent 
the same filters meet the constraints for scalable motion of quantization errors at other levels i ^ k 



compensation as well as for efficient pyramid decomposition 
Furthermore, a rate-distortion efficient embedded DFD cod- 



(36) 



ing method using conditioning contexts for coding is devel- The coded displaced frame difference at level k is then 

oped, which allows for progressive transmission and simplifies given by 

coder control significantly. Simulation results are provided for j(k) = fik) , 

the complete hybrid video coder. V / 

= *(*> +,<*>+ £ £<- fc (*<•)) + £ 



With the convention E°(x) = x, this leads to 

L-l 



appendix a 
Analysis of the Quantization Errors 

A. Open-Loop Quantization 

The effects of quantizing the Laplacian DFD pyramid L in 
an open loop are analyzed in the following. In an open-loop Thus, the quantization error g'( fc > of the displaced frame 



i? fc )=d< fc ) + X;^" fc (9 C0 ) : (37) 
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difference at level k is given by 



*W := <?*> - = ^ (<^>) . (38) 

is* 

Based on this relation, the quantization error for the coded 
frame is given by 

= ff W + ^W. (39) 

Therefore, the difference between the coded refinement image 
(level 0) and the coded and expanded base layer (level I) is 
affected only by the quantization error of the refinement layer 

- E(gW) = g m + rfW - E (gW + q'W) 

= gW-E( 9 W) +q M. (40) 

B. Quantization Error Feedback 

Furthermore, the quantization error feedback due to the 
predictive coding structure can be analyzed. Therefore, we 
assume no motion compensation (all displacement vectors are 
zero). Hence, the prediction is just the previously coded frame 
= g^li - The refinement level of the Laplacian pyramid 
to be coded 

depends on the reconstructed previous base and refinement 
level. Using (40) reveals 

W = *i 0) - b(&) ~ - - & (4.) 

that /^ 0) is affected only by the quantization error of 
this level. No shift of quantization errors across scales occurs. 
Therefore, coding of still images and unchanged regions 
between subsequent frames converges to lossless coding. 

appendix b 

Derivation of the Reduce and Expand Operators 
In this Appendix, reduce and expand operators are derived, 
which are based on spline approximations. The derivation 
follows in its main aspects [15], [9], and is given in terms 
of discrete one-dimensional operators. For construction of the 
pyramids used for motion estimation and for DFD coding, 
these operators are extended to the 2-D case by separable 
application along the rows and columns of the frames. 

The main difference between [9] and our approach is the 
centering of the lower resolution grid with respect to the high- 
resolution grid. This yields several advantages compared to 
the classical approach. First, the centering leads to a quadtree- 
like topology of the pyramid. Thus, each pixel on a coarse 
pyramid level belongs to four corresponding pixels of the 
next finer level, which is advantageous for the propagation of 
block-based motion vector fields across the differing resolu- 
tions. Second, consistent boundary conditions are obtained by 
mirrored extension of the finite signals, while the uncentered 



pyramid requires periodic extension which degrades the coding 
performance at signal support boundaries. 

A. Discrete Least Squares Approximation 

We consider the construction of a dyadic pyramid in a purely 
discrete framework. The basic operation is the approximation 
from a fine space Si onto a coarse space 52 at twice the 
scale. For convenience, we focus first on sequences of infinite 
length. Implications of the finite support of the signals are 
discussed in the following subsection. Thus, we set S\ = l 2 
(the space of squared summable sequences), and consider the 
coarser subspace 52 C l 2 generated from the even integer 
translates of a generating sequence h: 

S 2 = span{/i[A; - 2/]}/ eZ 

= = - 2l],s 2 e f 2 j. (42) 

We may think of s 2 [k] as being samples of some pyra- 
mid level. Then si[k] are the samples of the corresponding 
interpolated (expanded) level. We therefore define the (one- 
dimensional) expand operator E as 

£ : / 2 - Z 2 ; 5 2 ^ E{s 2 ) = [s 2 ] T2 * h (43) 

and 5 2 becomes simply the image of E 

$2 = {h\h=E(*2)>s 2 € J 2 }- (44) 

Note that the generating sequence h has a twofold meaning 
here: 1) it generates the Riesz basis [h[k - 2l]} [( =z of 5 2 , 
and hence serves as a scaling function, and 2) it equals the 
impulse response of the synthesis filter, thus defining the 
expand operator. 

The reduce operator R consists of an antialiasing filter 

o 

(or analysis filter) with impulse response h followed by a 
subsampler: 



R -h -* h\ si i-> s 2 - R(si) = 



si* h 



(45) 



12 



The concatenation of R and E approximates any sequence 

o 

si € l 2 with a sequence $i = E(R(si)) € 5 2 . A prefilter h 
is called optimal in the least squares sense with respect to a 
synthesis filter h when the energy £* eZ (3i [k] - sjA:]) 2 of the 
approximation error becomes minimal. 

Theorem 2: The optimal prefilter with respect to a synthesis 
filter with z transform H{z) is given by 

2H{z' 1 ) 



H{z) = 



(46) 



H{z)H(z-i)+H(-z)H(-z-iy 

Proof: The least squares approximation is achieved when 
the error S\[k] — S\[k\ is orthogonal to S? or, equivalently 

(h[k - 2/], 3l [k\) = (h[k - 2/], s 2 [n]h[k - 2n]j t 
Vie 2. 

With h T [k] = fc], we can rewrite the inner products as 
convolutions 

[/t r *s L ] 12 = [h T * h] i2 * 
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Application of the inverse of [h T * h]^ on both sides (which 
exists because we require h constituting a Riesz basis) yields 

S2 = ([h T *h} l2 r l *lh T * Sl ]^ 

= h *s\ 

J 12 

o 

Thus, the optimal prefilter has the impulse response h = 
[([h T * /i]i2) _1 ]t2 * h T * whose z transform is given by (46). 

■ 

Theorem 3: Least squares pyramids have the property that 
reduction of an expanded signal leaves the signal unchanged. 

Proof: By definition, the reduce operator R of a least 
squares pyramid is chosen such that the energy of the approx- 
imation error is minimized. This being true, the concatenated 
operator 

P : i 2 — S 2 , s^s = E(R{s)) 

must be an orthogonal projector from I2 into 5 2 . This implies 
P(P(s)) — P(s). For the reduce and expand operators, this 
means that E{R(E(R{3)))) = E{R{s)), and 



R{E(s)) = *, V* e h 



(47) 



follows immediately. B 
This property is crucial for the scalable coding scheme 
described in this paper. 

B. Centered Spline Approximation 

Until now, we have not specified the space 5 2 and the 
filter /i. We confine ourselves to spline pyramids mainly 
for to reasons. First, splines have excellent approximation 
properties [19], [20], Second, the spline framework allows 
for a progressive transition between the piecewise constant 
and the band-limited model. The spline of degree zero leads 
to the Haar wavelet (best time localization), and the cardinal 
spline filters converge to the sine interpolator for high degrees 
[21]. So it can be expected that a reasonable tradeoff between 
time localization and aliasing can be found inside the class of 
spline filters. 

Usually, the generating kernel h is symmetric, and the 
coarser grid points are positioned on the even integers. We 
propose instead to shift the sampling grid of the approximating 
function such that the samples "sit" centered between the 
samples of the high-resolution grid. Using the formalism 
developed in [15], it is not difficult to derive the z transform 
H&{z) of the cardinal spline interpolator corresponding to a 
shift A 



(48) 



where the sampled 5-spline filter kernels ±(z) are defined 



as 



B n mA (z)=Y,P n {— \~ k (««) 





K-2 K-1 K+l 
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Fig. 10. Odd and even mirroring of a discrete signal, (a) Odd mirroring in 
case of a traditional pyramid, (b) Even mirroring in case of a centered pyramid. 



and fi n (x) is the symmetrical B spline of degree n, defined by 
/9"(i) = /3°(x)*/3"- i (i) 



with 



for \x\ < 1/2 
for jxj > 1/2. 



For the shift parameter A = 0, the pyramid equals the one 
derived in [9], and corresponds to the traditional formulation of 
the discrete wavelet transform. For A = 1, the samples of the 
coarse grid sit on the odd integers, and for A = 1/2, we end up 
with the centered pyramid which we use for the compression 
algorithm. In Fig. 10, the positioning of the sampling grids 
is depicted for the traditional (A = 0) and the centered 
(A = 1/2) case. The centered pyramid allows us to assign 
each sample on the fine grid the closest sample on the coarse 
grid, whereas this is possible in the traditional pyramid only 
for the even samples. 

Until now, only infinite signals have been considered. In 
the case of finite signals, we have to impose suited boundary 
conditions. In signal processing, such boundary conditions are 
usually formulated in terms of signal extensions across the 
boundaries. It is easy to show that in the case of the uncentered 
spline approximation, consistent boundary conditions are given 
by periodic repetition of the signals. However, this introduces 
unnatural discontinuities at the boundaries, which degrade the 
performance of most signal processing applications. Therefore, 
a symmetric extension by mirroring is preferred in many 
applications, as depicted in Fig- 1 1 for continuous signals. 

If the length of the signal K is even, a proper symmetric 
extension is impossible as long as we stay with the uncentered 
grid. On the other hand, if we use a centered grid as proposed 
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Fig. 1 1. Symmetric boundary extension of a continuous signal with finite 
support. 



in this paper, we can use the "even" mirroring 

= *[*], Jfc = 0---X-1 

s[k\ = s[2K - 1 - JfcJ, Jfc = K • • • 2K - 1 

without distorting property (47), 
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